First Step: loading the libaries

First of all, we will clean the workspace and load all the necessary libraries:

rm(list=ls()) 
library(naniar)
library(mice)
library(tidyverse)
library(factoextra)
library(GGally)
library(xfun)
library(rmarkdown)
library(dplyr)
library(ggplot2)
library(cluster)
library(dendextend)
library(psych)
library(tidyr)
library(kernlab)
library(mclust)
library(NbClust)
library(Amelia)
library(corrplot)
library(VIM)

Introduction:

The dataset we are going to use for this project is called HR Analytics: Job Change of Data Scientists. Basically, a company wants to hire a data scientist so they store the data for the most promising profiles, since they want to reduce training and costs for the long run. This dataset is designed to understand which factors lead to a person to leave their current job, they want to hire the person that is the most likely to stay in the company based on their stored data. The link to download the data can be found in the following link: click here

For this particular project, we are going to use the train dataset, since I consider it has more value and more variables to it than the other one (for instance the target column).

setwd("~/UC3M/2nd year/statistical learning/archive")

data = read.csv("aug_train.csv", header = T, na.strings = "")
View(data)
head(data)
##   enrollee_id     city city_development_index gender     relevent_experience
## 1        8949 city_103                  0.920   Male Has relevent experience
## 2       29725  city_40                  0.776   Male  No relevent experience
## 3       11561  city_21                  0.624   <NA>  No relevent experience
## 4       33241 city_115                  0.789   <NA>  No relevent experience
## 5         666 city_162                  0.767   Male Has relevent experience
## 6       21651 city_176                  0.764   <NA> Has relevent experience
##   enrolled_university education_level major_discipline experience company_size
## 1       no_enrollment        Graduate             STEM        >20         <NA>
## 2       no_enrollment        Graduate             STEM         15        50-99
## 3    Full time course        Graduate             STEM          5         <NA>
## 4                <NA>        Graduate  Business Degree         <1         <NA>
## 5       no_enrollment         Masters             STEM        >20        50-99
## 6    Part time course        Graduate             STEM         11         <NA>
##     company_type last_new_job training_hours target
## 1           <NA>            1             36      1
## 2        Pvt Ltd           >4             47      0
## 3           <NA>        never             83      0
## 4        Pvt Ltd        never             52      1
## 5 Funded Startup            4              8      0
## 6           <NA>            1             24      1

In depth into the database

The features in the database are the following (they are mostly self-explanatory): * enrollee_id

  • city

  • city_development_index ->> the development the city where they are from has

  • gender

  • relevent_experience

  • enrolled_university

  • education_level

  • major_discipline

  • experience

  • company_size

  • company_type

  • last_new_job

  • training_hours

  • target->> (0) not looking for a job change (1)looking for a job change

Now, let’s dive into the dataset.

str(data)
## 'data.frame':    19158 obs. of  14 variables:
##  $ enrollee_id           : int  8949 29725 11561 33241 666 21651 28806 402 27107 699 ...
##  $ city                  : chr  "city_103" "city_40" "city_21" "city_115" ...
##  $ city_development_index: num  0.92 0.776 0.624 0.789 0.767 0.764 0.92 0.762 0.92 0.92 ...
##  $ gender                : chr  "Male" "Male" NA NA ...
##  $ relevent_experience   : chr  "Has relevent experience" "No relevent experience" "No relevent experience" "No relevent experience" ...
##  $ enrolled_university   : chr  "no_enrollment" "no_enrollment" "Full time course" NA ...
##  $ education_level       : chr  "Graduate" "Graduate" "Graduate" "Graduate" ...
##  $ major_discipline      : chr  "STEM" "STEM" "STEM" "Business Degree" ...
##  $ experience            : chr  ">20" "15" "5" "<1" ...
##  $ company_size          : chr  NA "50-99" NA NA ...
##  $ company_type          : chr  NA "Pvt Ltd" NA "Pvt Ltd" ...
##  $ last_new_job          : chr  "1" ">4" "never" "never" ...
##  $ training_hours        : int  36 47 83 52 8 24 24 18 46 123 ...
##  $ target                : num  1 0 0 1 0 1 0 1 1 0 ...

We are going to change the types to mostly factor so we can work on them better.

data$city = as.factor(data$city)
data$gender = as.factor(data$gender)
data$relevent_experience = as.factor(data$relevent_experience)
data$enrolled_university = as.factor(data$enrolled_university)
data$education_level  = as.factor(data$education_level)
data$major_discipline = as.factor(data$major_discipline)
data$experience = as.factor(data$experience)
data$company_size = as.factor(data$company_size)
data$last_new_job = as.factor(data$last_new_job)
data$training_hours = as.integer(data$training_hours)
data$target = as.factor(data$target)

There are some values in the company_size variable that are not in the right format, so let’s change it.

data$company_size = gsub("/","-",data$company_size)

Data preprocessing:

Let’s see our NA’s and plot them so we can inspect them and see what to do to fix it.

sum(is.na(data))
## [1] 20733

We have a lot of missing values! Let’s see them graphically.

hist(rowMeans(is.na(data)))

barplot(colMeans(is.na(data)), las=2)

aggr(data, numbers = TRUE, sortVars = TRUE, labels = names(data),
     cex.axis = .7, gap = 1, ylab= c('Missing data','Pattern'))

## 
##  Variables sorted by number of missings: 
##                Variable       Count
##            company_type 0.320492745
##            company_size 0.309948846
##                  gender 0.235306399
##        major_discipline 0.146831611
##         education_level 0.024010857
##            last_new_job 0.022079549
##     enrolled_university 0.020148241
##              experience 0.003392839
##             enrollee_id 0.000000000
##                    city 0.000000000
##  city_development_index 0.000000000
##     relevent_experience 0.000000000
##          training_hours 0.000000000
##                  target 0.000000000

We checked what we already knew, that there a lot of missing values in a lot of different variables!

gg_miss_upset(data)

The upset plot shows the combination of missings, by default choosing the 5 variables with the most missing values, and then orders them by the size of the missing values in that set. Basically, we can see that we have the most NAs where we have more observatoins (in company_size and company_type), whereas we get the minimal NAs in education_level where we have less observations, it makes more sense. Let’s see exactly how many NAs are in each variable and plot it

gender = sum(is.na(data$gender))
en_uni = sum(is.na(data$enrolled_university))
ed_lvl = sum(is.na(data$education_level))
m_disc = sum(is.na(data$major_discipline))
exp = sum(is.na(data$experience))
com__size = sum(is.na(data$company_size))
com_type = sum(is.na(data$company_type))
job = sum(is.na(data$last_new_job))
null_df = data.frame(variable=c("enrrollee_id", "city", "gender",
                                "city_development_index", "relevent_experience","enrolled_university","education_level",
                                "major_discipline", "experience", "company_size", "company_type", "last_new_job",
                                "training_hours", "target"),
total_null = c(0, 0, 4508, 0,0, 386, 460, 2813, 65, 5938, 6140,423,0,0))

null_df
##                  variable total_null
## 1            enrrollee_id          0
## 2                    city          0
## 3                  gender       4508
## 4  city_development_index          0
## 5     relevent_experience          0
## 6     enrolled_university        386
## 7         education_level        460
## 8        major_discipline       2813
## 9              experience         65
## 10           company_size       5938
## 11           company_type       6140
## 12           last_new_job        423
## 13         training_hours          0
## 14                 target          0
null_df$variable = factor(null_df$variable,
                          levels = null_df$variable[order(null_df$total_null,decreasing = TRUE)])

ggplot(null_df,aes(x=variable,y=total_null))+ geom_bar(stat = "identity", col = "#99FF99", fill = "#CCFF99")+ theme(axis.text.x = element_text(angle = 90, hjust =1, vjust = 0.5))+
  geom_label(aes(label = total_null, size = NULL), nudge_y = 0.6)+
  theme(plot.title = element_text(hjust = 0.6))+
  ggtitle("Total NAs by col")+
  xlab("Missing values")

Also did the missmap and got 12% of missing observations…let’s fix it by replacing the NAs for the average of the variable where is missing. Otherwise, we would miss a lot of information and valuable data to make the unsupervised learning to begin with. (And it was time and power consuming).

data$city_development_index[which(is.na(data$city_development_index))] = mean(data$city_development_index, na.rm = TRUE)
data$gender[which(is.na(data$gender))] = "Male"
data$company_size[which(is.na(data$company_size))] = "50-99"
data$relevent_experience[which(is.na(data$relevent_experience))] = "No relevent experience"
data$enrolled_university[which(is.na(data$enrolled_university))] = "no_enrollment"
data$education_level[which(is.na(data$education_level))] = "Graduate"
data$major_discipline[which(is.na(data$major_discipline))] = "STEM"
data$experience[which(is.na(data$experience))] = ">20"
data$last_new_job[which(is.na(data$last_new_job))] = 1
data$training_hours[which(is.na(data$training_hours))] = mean(data$training_hours, na.rm = TRUE)

sum(data$target == 0)
## [1] 14381
sum(data$target == 1)
## [1] 4777
data$target[which(is.na(data$target))] = 0

Now, we are going to remove those variables which we consider not relevant, such as city, enrolle_id and the company_type. We see that there are no NAs left.

data = select(data,-enrollee_id, -city,-company_type)
sum(is.na(data))
## [1] 0

Visualization tools to get insights before the tools

Let’s get the insight of what is going on in our data. First of all, we can see that there are way more men than any other gender (about 90%).

sum(data$gender == "Female")
## [1] 1238
sum(data$gender == "Male")
## [1] 17729
sum(data$gender == "Other")
## [1] 191
ggplot(data, aes(x=major_discipline))+
  geom_bar(fill = "#58A4B0")+
  geom_text(stat='count', aes(label=..count..), vjust=-1)+
  ggtitle("Major Distribution")+
  theme(plot.title = element_text(hjust = 0.5))+
  xlab("Major")

We have way more people in STEM than any other Major, by far as seen above (90.32%).

I also made a plot with all the people that have a major divided by gender but it was not significative since in our data there are mostly men, therefore, anything we divide by gender is going to tell us that there are more men.

ggplot(data,aes(x=relevent_experience))+
  geom_bar(fill = "#373F51")+
  facet_wrap(~education_level)+
  ggtitle("Relevent experience by education level")+
  theme(axis.text.x = element_text(angle = 90, hjust =1, vjust = 0.5))+
  theme(plot.title = element_text(hjust = 0.5))+
  xlab("Experience")+
  ylab("Count")

We appreciate that those with the most relevant experience are graduates from university, followed by those with a master’s degree.

ggplot(data, aes(x=major_discipline == "STEM"))+
  geom_bar(fill = "#58A4B0")+
  geom_text(stat='count', aes(label=..count..), vjust=-1)+
  ggtitle("Major Distribution")+
  facet_wrap(~enrolled_university)+
  theme(plot.title = element_text(hjust = 0.5))+
  xlab("Enrollement in STEM")

We know that out of our candidates, 73% were not enrolled in university, so it adds up that most of them where not enrolled in STEM. But, out of those, it is more usual for them to be enrolled in the full time course rather than the part time one.

ggplot(data,aes(x= training_hours,fill = relevent_experience))+
  geom_density(alpha = 0.5)

ggplot(data,aes(x= training_hours,fill = relevent_experience))+
  geom_density(alpha = 0.5) + facet_wrap(~relevent_experience)

ggplot(data,aes(x= training_hours,fill = relevent_experience))+
  geom_density(alpha = 0.5)+ xlim(10,70)

It is basically the same training hours completed whether or not the candidate had any relevant experience. And we can see that the average of those training hours is around 20 to 30 hours.

ggplot(data, aes(x=city_development_index, fill=relevent_experience)) +
  geom_density(adjust=1.5, alpha=.4) +
  theme_light()

The ones with most relevant experience come from a city with a bigger development index, but there are also a lot of them without it. We also see that around 0.6 development index, there is a little peak with people with relevant experience. But most of the candidates are from the 0.9 city development index, too.

ggplot(data, aes(x=training_hours, group=education_level, fill=education_level)) +
  geom_density(adjust=1.5, alpha = 0.5) +
  theme_bw() + facet_wrap(~education_level)

This time, just like the relevant experience, we see that the training hours is independent of the candidate’s level of education, they are mostly equal.

ggplot(data,aes(x=education_level,fill = education_level))+
  geom_bar()+
  facet_wrap(~target)+
  geom_text(stat='count', aes(label=..count..), vjust=-1,check_overlap = TRUE)+
  theme(plot.title = element_text(hjust = 0.5))+
  theme(axis.text.x = element_text(angle = 90, hjust =1, vjust = 0.5))+
  ggtitle("Target by education level")+
  xlab("Target")+
  ylab("Count")

Now, the people that are most likely to want to leave their current job are Graduates, but also they are the most likely to not want to leave it so let’s check it by years of experience.

ggplot(data,aes(x=experience,fill = experience))+
  geom_bar()+
  facet_wrap(~target)+
  geom_text(stat='count', aes(label=..count..), vjust=-1,check_overlap = TRUE)+
  theme(plot.title = element_text(hjust = 0.5))+
  theme(axis.text.x = element_text(angle = 90, hjust =1, vjust = 0.5))+
  ggtitle("Target by years of experience")+
  xlab("Target")+
  ylab("Count")

What is interesting to see is that the ones most likely to want to change their current job are those with more than 20 years of experience and those with 4 to 6 years of experience.

Principal Component Analysis (PCA):

Let’s change most of our variables to numeric first:

data$city = as.numeric(data$city)
data$relevent_experience = as.numeric(data$relevent_experience)
data$gender = as.numeric(data$gender)
data$experience = as.numeric(data$experience)
data$last_new_job = as.numeric(data$last_new_job)
data$training_hours = as.numeric(data$training_hours)
data$target = as.numeric(data$target)
data$education_level = as.numeric(data$education_level)
data$major_discipline = as.numeric(data$major_discipline)
data$enrolled_university = as.numeric(data$enrolled_university)
data$company_size = as.factor(data$company_size)
data$company_size = as.numeric(data$company_size)

data = select(data, -city)
corrplot(cor(data), is.corr = F,
         number.cex = 0.4, tl.cex = 0.4, cl.cex = 0.4)

ggcorr(data, label = T)

We do not have a really strong correlation between variables but let’s keep going either way.

pca1 = prcomp(data, scale=T)
summary(pca1)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.3516 1.1289 1.04181 1.01213 0.99899 0.98635 0.94746
## Proportion of Variance 0.1661 0.1159 0.09867 0.09313 0.09073 0.08844 0.08161
## Cumulative Proportion  0.1661 0.2819 0.38060 0.47373 0.56445 0.65290 0.73450
##                            PC8     PC9    PC10    PC11
## Standard deviation     0.93341 0.89518 0.81149 0.76768
## Proportion of Variance 0.07921 0.07285 0.05987 0.05358
## Cumulative Proportion  0.81371 0.88656 0.94642 1.00000

We achieve a table with all the components but let’s plot it so we can visually have a better insight.

screeplot(pca1,main="Screeplot",col="#aC5E99",type="barplot",pch=19)

But is nicer and much more informative if we use the factoextra:

fviz_screeplot(pca1, addlabels = TRUE)

We see that we need most of the components to explain the variance, because just 1 explains barely the 17%.

fviz_pca_ind(pca1, geom.ind = "point", col.ind = "#aC5E99", 
             axes = c(1, 2), pointsize = 1.5, repel = T) 

What would have happened if we scaled the data?

pca2 = prcomp(data, center = T, scale. = T); summary(pca2)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.3516 1.1289 1.04181 1.01213 0.99899 0.98635 0.94746
## Proportion of Variance 0.1661 0.1159 0.09867 0.09313 0.09073 0.08844 0.08161
## Cumulative Proportion  0.1661 0.2819 0.38060 0.47373 0.56445 0.65290 0.73450
##                            PC8     PC9    PC10    PC11
## Standard deviation     0.93341 0.89518 0.81149 0.76768
## Proportion of Variance 0.07921 0.07285 0.05987 0.05358
## Cumulative Proportion  0.81371 0.88656 0.94642 1.00000
fviz_screeplot(pca2, addlabels = TRUE)

fviz_pca_var(pca2, col.var = "cos2", geom.var = "arrow", 
             labelsize = 2, repel = T)

barplot(pca1$rotation[,1], las=2, col="#aC5E99")

barplot(pca2$rotation[,1], las=2, col="#58A4B0")

Both of them are the same and the same variance is explained either way.

fviz_contrib(pca1, choice = "var", axes = 2)

Here we can see the contribution depending on the variable, being the city_development_index the one that contributes the most, followed by education_level and relevent_experience.

head(get_pca_ind(pca1)$contrib[,1])
##            1            2            3            4            5            6 
## 3.777108e-04 2.202977e-04 2.923932e-02 5.666895e-03 3.247206e-05 9.915507e-04

The contribution indices are really low.

Factor Analysis:

Factor Analysis and PCA are pretty similar. Both of them describe the variablity in our data. In this case, FA is a technique used to reduce a large number of variables into a fewer number of factors. The biggest difference is that FA is restricted by normality and linearity. Let’s use 3 factors as a default.

factosys = factanal(data,factors = 3, rotation = "none", scores = "regression")
factosys
## 
## Call:
## factanal(x = data, factors = 3, scores = "regression", rotation = "none")
## 
## Uniquenesses:
## city_development_index                 gender    relevent_experience 
##                  0.685                  0.998                  0.390 
##    enrolled_university        education_level       major_discipline 
##                  0.834                  0.973                  0.948 
##             experience           company_size           last_new_job 
##                  0.860                  0.931                  0.832 
##         training_hours                 target 
##                  0.999                  0.005 
## 
## Loadings:
##                        Factor1 Factor2 Factor3
## city_development_index -0.343           0.437 
## gender                                        
## relevent_experience     0.130   0.762   0.111 
## enrolled_university    -0.122  -0.368   0.124 
## education_level                         0.118 
## major_discipline                       -0.226 
## experience                      0.168  -0.327 
## company_size            0.102   0.229         
## last_new_job                    0.379  -0.151 
## training_hours                                
## target                  0.997                 
## 
##                Factor1 Factor2 Factor3
## SS loadings      1.170   0.954   0.422
## Proportion Var   0.106   0.087   0.038
## Cumulative Var   0.106   0.193   0.231
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 381.76 on 25 degrees of freedom.
## The p-value is 1.67e-65
cbind(factosys$loadings, factosys$uniquenesses)
##                             Factor1      Factor2      Factor3          
## city_development_index -0.343198908 -0.080846443  0.436506244 0.6851434
## gender                 -0.006668926  0.012987558 -0.046550120 0.9976168
## relevent_experience     0.130161918  0.762190626  0.111095547 0.3897815
## enrolled_university    -0.121909998 -0.368158317  0.124301108 0.8341446
## education_level        -0.084114978  0.077107642  0.118282651 0.9729891
## major_discipline        0.013587355  0.030616715 -0.225659161 0.9479514
## experience              0.070105669  0.167865929 -0.326775589 0.8601227
## company_size            0.101791249  0.228602392  0.082825361 0.9305236
## last_new_job            0.044875196  0.378766910 -0.150702098 0.8318011
## training_hours         -0.021626402 -0.009556137 -0.018994553 0.9990832
## target                  0.997494132 -0.002014260  0.001193804 0.0050000

The first factor explains a 99% of the target, the second factor explains mostly the relevent_experience and the third one is mostly related to the city_development_index.

factosys2 = factanal(data,factors =  3, rotation = "varimax", scores = "regression")
factosys2
## 
## Call:
## factanal(x = data, factors = 3, scores = "regression", rotation = "varimax")
## 
## Uniquenesses:
## city_development_index                 gender    relevent_experience 
##                  0.685                  0.998                  0.390 
##    enrolled_university        education_level       major_discipline 
##                  0.834                  0.973                  0.948 
##             experience           company_size           last_new_job 
##                  0.860                  0.931                  0.832 
##         training_hours                 target 
##                  0.999                  0.005 
## 
## Loadings:
##                        Factor1 Factor2 Factor3
## city_development_index -0.283          -0.480 
## gender                                        
## relevent_experience             0.780         
## enrolled_university            -0.362  -0.177 
## education_level                        -0.118 
## major_discipline                        0.228 
## experience                      0.134   0.349 
## company_size                    0.247         
## last_new_job                    0.360   0.195 
## training_hours                                
## target                  0.984   0.116   0.115 
## 
##                Factor1 Factor2 Factor3
## SS loadings      1.068   0.973   0.505
## Proportion Var   0.097   0.088   0.046
## Cumulative Var   0.097   0.186   0.231
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 381.76 on 25 degrees of freedom.
## The p-value is 1.67e-65
cbind(factosys2$loadings, factosys2$uniquenesses)
##                            Factor1      Factor2     Factor3          
## city_development_index -0.28326449 -0.066857440 -0.47974019 0.6851434
## gender                 -0.01303137  0.006317856  0.04658683 0.9976168
## relevent_experience     0.03988904  0.780056966 -0.01178520 0.3897815
## enrolled_university    -0.05931015 -0.362004784 -0.17688471 0.8341446
## education_level        -0.08096397  0.080505905 -0.11821733 0.9729891
## major_discipline       -0.01365783  0.004182644  0.22768405 0.9479514
## experience              0.01377505  0.133748076  0.34899536 0.8601227
## company_size            0.07890051  0.247431743 -0.04508652 0.9305236
## last_new_job           -0.02074055  0.360125061  0.19511291 0.8318011
## training_hours         -0.02201464 -0.014295825  0.01519208 0.9990832
## target                  0.98402823  0.116063458  0.11496846 0.0050000
barplot(factosys$loadings[,1], names=F, las=2, col="#6633CC", ylim = c(-1, 1))

barplot(factosys$loadings[,2], names=F, las=2, col="#6633CC", ylim = c(-1, 1))

barplot(factosys$loadings[,3], las=2, col="#6633CC", ylim = c(-1, 1))

barplot(factosys2$loadings[,1], names=F, las=2, col="#FF9966", ylim = c(-1, 1))

barplot(factosys2$loadings[,2], names=F, las=2, col="#FF9966", ylim = c(-1, 1))

barplot(factosys2$loadings[,3], las=2, col="#FF9966", ylim = c(-1, 1))

Clustering:

Clustering consists of grouping abstract objects into similar groups of our dataset. Firstly, we will start with kmeans with 5 centers by default.

data_scaled= scale(data) 
k1 = kmeans(data_scaled, centers = 5, nstart = 100) 
k1_cluster = k1$cluster
k1_centers = k1$centers

#representation 
barplot(table(k1_cluster), col="#9999CC", xlab = "clusters")

Now, let’s understand the centers:

bar1 = barplot(k1_centers[1,], las=2, col="#9999CC",ylim = c(-2,2),
               main= paste("Cluster 1: center (blue) and global center (pink)"))
points(bar1, y = apply(data_scaled, 2, quantile, 0.50),col="#E85D75", pch=19)

Now, we plot it in the second center.

bar2 = barplot(k1_centers[2,], las=2, col="#9999CC",ylim = c(-2,2),
               main= paste("Cluster 1: center (blue) and global center (pink)"))
points(bar1, y = apply(data_scaled, 2, quantile, 0.50),col="#E85D75", pch=19)

Clusplot:

fviz_cluster(k1, data = data_scaled, geom  = c("point"), ellipse.type = "norm")+
  theme_minimal()

With the silhouette method we can check how many groups is the optimal number of centers.

#fviz_nbclust(data_scaled, kmeans, method = 'silhouette')
#I have also tried to plot it using the silhouette function but it seems it takes a lot of capacity so I was not able to.
#fviz_nbclust(data_scaled, kmeans, method = 'wss')

My laptop is not strong enough to be able to run this since is very power consuming but let us assume that the ideal number is 3.

k2 = kmeans(data_scaled, centers = 3, nstart = 100)
k2_clusters = k2$cluster
k2_centers = k2$centers

barplot(table(k2_clusters), col="#9999CC", xlab = "clusters")

b2 = barplot(k2_centers[3,], las = 2, col="#FFC15E",ylim = c(-2,2), xlab = "appropriate cluster", ylab = "inappropriate observations")
points(bar2 ,y = apply(data_scaled, 2, quantile, 0.5), col ="#BF4E30", pch = 20)

fviz_cluster(k2, data = data_scaled, geom  = c("point"), ellipse.type = "norm")+
  theme_minimal()

Minkowski method:

If we use this other plot, we have less overlapping:

#k2_min2 = eclust(data_scaled, "kmeans", stand=T, k=3, graph = T, hc_metric = "minkowski")

Either way, let’s try with 2 clusters:

#k2_min2 = eclust(data_scaled, "kmeans", stand=T, k=2, graph = T, hc_metric = "minkowski")

It does not run when knitted since it produces an error due to the amount of GBs required. But it did run in my computer and the one with 2 clusters the overlapping was basically non-existing.

Profile variables:

The profiles are those variables not included in the clustering. So we are going to use those that we left outside since we considered that were non-relevant to better understand our clusters. Let us consider 3 clusters of those 3 variables (company_size, training_hours and education_level):

fit = kmeans(data_scaled, centers=3, nstart=100)
groups = fit$cluster

# We consider first the training_hours

as.data.frame(data_scaled) %>% mutate(cluster=factor(groups), company_size=company_size, min=training_hours, education_level = education_level) %>%
  ggplot(aes(x = cluster, y = min)) + 
  geom_boxplot(fill="lightblue") +
  labs(title = "training hours by cluster", x = "", y = "", col = "")

We can see in the 3 clusters, there a lot of outliers. Let’s see now the company size:

as.data.frame(data_scaled) %>% mutate(cluster=factor(groups), company_size=company_size, min=training_hours, education_level = education_level) %>%
  ggplot(aes(x = cluster, y = company_size)) + 
  geom_boxplot(fill="lightblue") +
  labs(title = " company_size  by cluster", x = "", y = "", col = "")

We see that the first cluster is actually pretty good, the other two have the median quite tilted so is not that good.

fviz_cluster(fit, data = data_scaled, geom  = c("point"), ellipse.type = "norm")+
  theme_dark()

Mahalanobis distance

The last way we are going to try is the Mahalanobis distance.This method measures the distance between a point and a distribution. It’s a multidimensional method. Let us try it with 3 centers:

fit.mahalanobis = kmeans(data_scaled, centers=3, nstart=100)
groups_mah = fit.mahalanobis$cluster
centers_mah=fit.mahalanobis$centers
colnames(centers_mah)=colnames(data_scaled)

fviz_cluster(fit.mahalanobis, data = data_scaled, geom = c("point"),ellipse.type = 'norm', pointsize=1)+
  theme_minimal()

This method is less power consuming and has a neat graphical interpretation (basically the same as the first one), both methods are valid. The only one I would left behind is the Minkowski one.

Conclusion:

Basically, studying the behavior of humans is always tricky since sometimes we are almost unpredictable about our desires and our future is always uncertain. Thus, is hard to know why people stay or leave a commpany depending on their background. But it was interesting to see the relations that it may cause (whether they are coincidences or not).

We have also learned the different methods to use in a large dataset (as in this case), and that some of them are very time and power consuming.